fbeta_score (Fβ score)#

The Fβ score measures the quality of a binary classifier by combining precision and recall into a single number.

It generalizes the F1 score by letting you choose how much more you care about recall vs precision.

Learning goals#

  • Define Fβ from the confusion matrix (math + intuition)

  • Implement fbeta_score from scratch in NumPy (with edge cases)

  • Visualize how β and the decision threshold change the score (Plotly)

  • Use Fβ to optimize a simple classifier (threshold tuning + a smooth surrogate)

Quick import (reference)#

from sklearn.metrics import fbeta_score

Prerequisites#

  • Binary classification with labels in {0, 1} (we treat 1 as the positive class)

  • Confusion matrix terms: TP, FP, FN, TN

  • Basic NumPy

This notebook focuses on binary Fβ. For multiclass/multilabel, most libraries compute Fβ via one-vs-rest + averaging (micro/macro/weighted).

import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
pio.templates.default = "plotly_white"

rng = np.random.default_rng(0)
np.set_printoptions(precision=4, suppress=True)

Confusion matrix (binary)#

Let:

  • true labels: (y \in {0, 1})

  • predicted labels: (\hat{y} \in {0, 1})

  • 1 is the positive class

(\hat{y}=1)

(\hat{y}=0)

(y=1)

TP

FN

(y=0)

FP

TN

Important: Fβ does not use TN. That’s a feature (when you care mostly about the positive class), but also a limitation.

def sigmoid(z):
    z = np.asarray(z, dtype=float)
    z = np.clip(z, -50.0, 50.0)
    return 1.0 / (1.0 + np.exp(-z))


def safe_divide(numer, denom, *, zero_division=0.0):
    """Elementwise numer/denom with a configurable value when denom == 0."""
    numer = np.asarray(numer, dtype=float)
    denom = np.asarray(denom, dtype=float)
    out = np.full_like(numer + denom, fill_value=float(zero_division), dtype=float)
    np.divide(numer, denom, out=out, where=denom != 0)
    return out


def confusion_counts_binary(y_true, y_pred, *, pos_label=1):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    y_true_pos = y_true == pos_label
    y_pred_pos = y_pred == pos_label

    tp = int(np.sum(y_true_pos & y_pred_pos))
    fp = int(np.sum(~y_true_pos & y_pred_pos))
    fn = int(np.sum(y_true_pos & ~y_pred_pos))
    tn = int(np.sum(~y_true_pos & ~y_pred_pos))
    return tp, fp, fn, tn


def precision_recall_fbeta_from_counts(tp, fp, fn, *, beta=1.0, zero_division=0.0):
    if beta <= 0:
        raise ValueError("beta must be > 0")
    beta2 = beta**2

    precision = float(safe_divide(tp, tp + fp, zero_division=zero_division))
    recall = float(safe_divide(tp, tp + fn, zero_division=zero_division))

    fbeta = float(
        safe_divide(
            (1.0 + beta2) * tp,
            (1.0 + beta2) * tp + beta2 * fn + fp,
            zero_division=zero_division,
        )
    )
    return precision, recall, fbeta


def precision_recall_fbeta(y_true, y_pred, *, beta=1.0, pos_label=1, zero_division=0.0):
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=pos_label)
    return precision_recall_fbeta_from_counts(tp, fp, fn, beta=beta, zero_division=zero_division)


def fbeta_score_numpy(y_true, y_pred, *, beta=1.0, pos_label=1, zero_division=0.0):
    _, _, fbeta = precision_recall_fbeta(
        y_true, y_pred, beta=beta, pos_label=pos_label, zero_division=zero_division
    )
    return fbeta

Precision, recall, and Fβ (math)#

Precision and recall are:

\[ P = \frac{TP}{TP + FP}, \qquad R = \frac{TP}{TP + FN} \]

The Fβ score is a weighted harmonic mean of (P) and (R):

\[ F_\beta = \frac{(1+\beta^2)PR}{\beta^2 P + R} \]

A very useful confusion-matrix form is:

\[ F_\beta = \frac{(1+\beta^2)\,TP}{(1+\beta^2)\,TP + \beta^2\,FN + FP} \]

How β changes the trade-off

  • (\beta = 1) gives F1 (precision and recall weighted equally)

  • (\beta > 1) favors recall (it upweights FN by (\beta^2))

  • (\beta < 1) favors precision

y_true = np.array([1, 1, 1, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 1, 0, 0, 0])

for beta in [0.5, 1.0, 2.0]:
    p, r, f = precision_recall_fbeta(y_true, y_pred, beta=beta)
    print(f"beta={beta:>3}: precision={p:.3f}, recall={r:.3f}, Fbeta={f:.3f}")

try:
    from sklearn.metrics import fbeta_score as skl_fbeta_score

    print("\nscikit-learn check:")
    for beta in [0.5, 1.0, 2.0]:
        print(f"beta={beta:>3}: sklearn={skl_fbeta_score(y_true, y_pred, beta=beta):.3f}")
except Exception as e:
    print("\n(scikit-learn not available for comparison)")
    print("Reason:", repr(e))
beta=0.5: precision=0.667, recall=0.667, Fbeta=0.667
beta=1.0: precision=0.667, recall=0.667, Fbeta=0.667
beta=2.0: precision=0.667, recall=0.667, Fbeta=0.667
scikit-learn check:
beta=0.5: sklearn=0.667
beta=1.0: sklearn=0.667
beta=2.0: sklearn=0.667

Scores vs labels: the role of the decision threshold#

Many models output a score or probability (s(x) \in [0,1]), then convert it to a label using a threshold (t):

\[ \hat{y}(x) = \mathbb{1}[s(x) \ge t] \]
  • Increasing (t) usually increases precision (fewer predicted positives) but decreases recall.

  • Since Fβ depends on TP/FP/FN, it depends on the choice of (t).

A very common workflow is:

  1. Train a model with a differentiable loss (e.g., log loss)

  2. Choose (t) on a validation set to maximize your target metric (e.g., F2)

def pr_fbeta_curve(y_true, y_score, *, beta=1.0, thresholds=None, zero_division=0.0):
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score, dtype=float)
    if thresholds is None:
        thresholds = np.linspace(0.0, 1.0, 301)
    thresholds = np.asarray(thresholds, dtype=float)

    pred_pos = y_score[:, None] >= thresholds[None, :]
    y_pos = (y_true == 1)[:, None]

    tp = np.sum(pred_pos & y_pos, axis=0)
    fp = np.sum(pred_pos & ~y_pos, axis=0)
    fn = np.sum(~pred_pos & y_pos, axis=0)

    precision = safe_divide(tp, tp + fp, zero_division=zero_division)
    recall = safe_divide(tp, tp + fn, zero_division=zero_division)

    beta2 = beta**2
    fbeta = safe_divide(
        (1.0 + beta2) * tp,
        (1.0 + beta2) * tp + beta2 * fn + fp,
        zero_division=zero_division,
    )

    return thresholds, precision, recall, fbeta, tp, fp, fn


def best_threshold_for_fbeta(y_true, y_score, *, beta=1.0, thresholds=None, zero_division=0.0):
    thresholds, precision, recall, fbeta, tp, fp, fn = pr_fbeta_curve(
        y_true, y_score, beta=beta, thresholds=thresholds, zero_division=zero_division
    )
    i = int(np.nanargmax(fbeta))
    return {
        "threshold": float(thresholds[i]),
        "fbeta": float(fbeta[i]),
        "precision": float(precision[i]),
        "recall": float(recall[i]),
        "tp": int(tp[i]),
        "fp": int(fp[i]),
        "fn": int(fn[i]),
        "index": i,
        "thresholds": thresholds,
        "precision_curve": precision,
        "recall_curve": recall,
        "fbeta_curve": fbeta,
    }
# Toy example: probability-like scores with overlap + class imbalance
n_pos, n_neg = 180, 820
y_true = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score = np.r_[rng.beta(5, 2, size=n_pos), rng.beta(2, 5, size=n_neg)]

perm = rng.permutation(len(y_true))
y_true, y_score = y_true[perm], y_score[perm]

thresholds = np.linspace(0.0, 1.0, 301)
_, precision, recall, _, _, _, _ = pr_fbeta_curve(y_true, y_score, beta=1.0, thresholds=thresholds)

best_05 = best_threshold_for_fbeta(y_true, y_score, beta=0.5, thresholds=thresholds)
best_1 = best_threshold_for_fbeta(y_true, y_score, beta=1.0, thresholds=thresholds)
best_2 = best_threshold_for_fbeta(y_true, y_score, beta=2.0, thresholds=thresholds)

best_05["threshold"], best_1["threshold"], best_2["threshold"]
(0.6266666666666667, 0.5333333333333333, 0.5)
fig = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.12,
    subplot_titles=("Precision & recall vs threshold", "Fβ vs threshold"),
)

fig.add_trace(go.Scatter(x=thresholds, y=precision, mode="lines", name="precision"), row=1, col=1)
fig.add_trace(go.Scatter(x=thresholds, y=recall, mode="lines", name="recall"), row=1, col=1)

for beta, best in [(0.5, best_05), (1.0, best_1), (2.0, best_2)]:
    fig.add_trace(
        go.Scatter(
            x=best["thresholds"],
            y=best["fbeta_curve"],
            mode="lines",
            name=f"F{beta:g}",
        ),
        row=2,
        col=1,
    )
    fig.add_vline(
        x=best["threshold"],
        line_width=1,
        line_dash="dot",
        line_color="gray",
        row="all",
        col=1,
    )

fig.update_xaxes(title_text="threshold t", row=2, col=1)
fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_yaxes(title_text="Fβ", row=2, col=1)
fig.update_layout(height=700, legend_orientation="h")
fig.show()

Precision–recall curve + iso-Fβ lines#

As you sweep the threshold (t), you trace out a curve in (recall, precision) space.

For a fixed (\beta), you can also draw iso-Fβ curves. Points on higher iso-curves have higher Fβ.

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=recall,
        y=precision,
        mode="markers+lines",
        marker=dict(
            color=thresholds,
            colorscale="Viridis",
            showscale=True,
            colorbar=dict(title="threshold"),
            size=6,
        ),
        name="PR (threshold sweep)",
    )
)

# Iso-Fβ lines for beta=2
beta_iso = 2.0
beta2 = beta_iso**2
p_grid = np.linspace(1e-3, 1.0, 400)
for F in [0.2, 0.4, 0.6, 0.8]:
    denom = (1.0 + beta2) * p_grid - F
    r = (F * beta2 * p_grid) / denom
    mask = (denom > 0) & (r >= 0) & (r <= 1)
    fig.add_trace(
        go.Scatter(
            x=r[mask],
            y=p_grid[mask],
            mode="lines",
            line=dict(width=1, dash="dot"),
            name=f"iso-F{beta_iso:g}={F}",
            opacity=0.8,
        )
    )

fig.update_layout(
    title="Precision–Recall curve with iso-F2 lines",
    xaxis_title="recall",
    yaxis_title="precision",
    height=600,
)
fig.update_xaxes(range=[0, 1])
fig.update_yaxes(range=[0, 1])
fig.show()

Using Fβ to optimize a simple classifier#

Two practical ways to “optimize for Fβ” are:

  1. Train a model with a standard loss (e.g., log loss), then tune the threshold (t) to maximize Fβ on a validation set.

  2. Optimize a smooth surrogate of Fβ (use probabilities instead of hard labels) with gradient-based methods.

We’ll do both with a from-scratch logistic regression.

def find_bias_for_target_rate(logits, target_rate, *, iters=60):
    lo, hi = -20.0, 20.0
    for _ in range(iters):
        mid = (lo + hi) / 2.0
        rate = sigmoid(logits + mid).mean()
        if rate > target_rate:
            hi = mid
        else:
            lo = mid
    return (lo + hi) / 2.0


def make_synthetic_logistic_data(n=4000, *, target_pos_rate=0.15, seed=0):
    rng_local = np.random.default_rng(seed)
    X = rng_local.normal(size=(n, 2))
    true_w = np.array([2.0, -1.2])
    base_logits = X @ true_w
    true_b = find_bias_for_target_rate(base_logits, target_pos_rate)
    probs = sigmoid(base_logits + true_b)
    y = rng_local.binomial(1, probs).astype(int)
    return X, y, probs, (true_w, true_b)


def train_val_test_split(X, y, *, ratios=(0.6, 0.2, 0.2), seed=0):
    if not np.isclose(sum(ratios), 1.0):
        raise ValueError("ratios must sum to 1")
    rng_local = np.random.default_rng(seed)
    n = X.shape[0]
    perm = rng_local.permutation(n)
    X, y = X[perm], y[perm]

    n_train = int(ratios[0] * n)
    n_val = int(ratios[1] * n)
    X_train, y_train = X[:n_train], y[:n_train]
    X_val, y_val = X[n_train : n_train + n_val], y[n_train : n_train + n_val]
    X_test, y_test = X[n_train + n_val :], y[n_train + n_val :]
    return X_train, y_train, X_val, y_val, X_test, y_test


X, y, _, (true_w, true_b) = make_synthetic_logistic_data(n=4000, target_pos_rate=0.12, seed=1)
X_train, y_train, X_val, y_val, X_test, y_test = train_val_test_split(X, y, seed=1)

print("positive rate (train/val/test):", y_train.mean(), y_val.mean(), y_test.mean())
positive rate (train/val/test): 0.12 0.11625 0.115
def add_intercept(X):
    X = np.asarray(X, dtype=float)
    return np.c_[np.ones((X.shape[0], 1)), X]


def log_loss_and_grad(w, Xb, y, *, l2=0.0, eps=1e-12):
    z = Xb @ w
    p = sigmoid(z)

    y = y.astype(float)
    loss = -np.mean(y * np.log(p + eps) + (1.0 - y) * np.log(1.0 - p + eps))
    if l2:
        loss += 0.5 * l2 * np.sum(w[1:] ** 2)

    grad = (Xb.T @ (p - y)) / Xb.shape[0]
    if l2:
        grad[1:] += l2 * w[1:]
    return loss, grad


def fit_logistic_regression_ce(X, y, *, lr=0.2, steps=1500, l2=0.0, seed=0):
    rng_local = np.random.default_rng(seed)
    Xb = add_intercept(X)
    w = rng_local.normal(scale=0.01, size=Xb.shape[1])

    history = []
    for step in range(steps):
        loss, grad = log_loss_and_grad(w, Xb, y, l2=l2)
        w -= lr * grad
        if step % 20 == 0 or step == steps - 1:
            history.append((step, loss))
    return w, np.array(history)


w_ce, hist_ce = fit_logistic_regression_ce(X_train, y_train, lr=0.3, steps=1200, l2=1e-3, seed=1)
hist_ce[:5], hist_ce[-5:]
(array([[ 0.    ,  0.6936],
        [20.    ,  0.3378],
        [40.    ,  0.2818],
        [60.    ,  0.2603],
        [80.    ,  0.249 ]]),
 array([[1120.    ,    0.2243],
        [1140.    ,    0.2243],
        [1160.    ,    0.2243],
        [1180.    ,    0.2243],
        [1199.    ,    0.2243]]))
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_ce[:, 0], y=hist_ce[:, 1], mode="lines", name="log loss"))
fig.update_layout(title="Logistic regression training (cross-entropy)", xaxis_title="step", yaxis_title="log loss")
fig.show()
def predict_proba(w, X):
    Xb = add_intercept(X)
    return sigmoid(Xb @ w)


def evaluate_thresholded(y_true, y_score, *, threshold, beta=1.0):
    y_pred = (y_score >= threshold).astype(int)
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred)
    precision, recall, fbeta = precision_recall_fbeta_from_counts(tp, fp, fn, beta=beta)
    return {
        "threshold": float(threshold),
        "beta": float(beta),
        "precision": float(precision),
        "recall": float(recall),
        "fbeta": float(fbeta),
        "tp": int(tp),
        "fp": int(fp),
        "fn": int(fn),
        "tn": int(tn),
    }


val_scores_ce = predict_proba(w_ce, X_val)
test_scores_ce = predict_proba(w_ce, X_test)

betas = [0.5, 1.0, 2.0]
rows = []
for beta in betas:
    best = best_threshold_for_fbeta(y_val, val_scores_ce, beta=beta, thresholds=np.linspace(0, 1, 501))
    test_eval = evaluate_thresholded(y_test, test_scores_ce, threshold=best["threshold"], beta=beta)
    rows.append({
        "beta": beta,
        "best_val_threshold": best["threshold"],
        "val_fbeta": best["fbeta"],
        "test_precision": test_eval["precision"],
        "test_recall": test_eval["recall"],
        "test_fbeta": test_eval["fbeta"],
    })

try:
    import pandas as pd

    pd.DataFrame(rows)
except Exception:
    rows

Notice how the optimal threshold changes with β.

  • With (\beta > 1), you typically pick a lower threshold to increase recall.

  • With (\beta < 1), you often pick a higher threshold to increase precision.

thresholds = np.linspace(0.0, 1.0, 501)

fig = go.Figure()
for beta in [0.5, 1.0, 2.0]:
    best = best_threshold_for_fbeta(y_val, val_scores_ce, beta=beta, thresholds=thresholds)
    fig.add_trace(go.Scatter(x=thresholds, y=best["fbeta_curve"], mode="lines", name=f"val F{beta:g}"))
    fig.add_trace(
        go.Scatter(
            x=[best["threshold"]],
            y=[best["fbeta"]],
            mode="markers",
            marker=dict(size=10),
            name=f"best t for F{beta:g}",
        )
    )

fig.update_layout(
    title="Validation: Fβ vs threshold (same model, different β)",
    xaxis_title="threshold",
    yaxis_title="Fβ",
    height=500,
)
fig.show()

Direct optimization (optional): a differentiable “soft Fβ” surrogate#

Hard Fβ uses thresholded predictions (\hat{y} \in {0,1}), so it’s not differentiable in the model parameters.

A common trick is to replace (\hat{y}) with the model probability (p\in[0,1]) and define “soft” counts:

\[ \widetilde{TP} = \sum_i y_i p_i, \quad \widetilde{FP} = \sum_i (1-y_i)p_i, \quad \widetilde{FN} = \sum_i y_i (1-p_i) \]

Then plug them into the same formula:

\[ \widetilde{F}_\beta = \frac{(1+\beta^2)\widetilde{TP}}{(1+\beta^2)\widetilde{TP} + \beta^2\widetilde{FN} + \widetilde{FP}} \]

This surrogate is smooth in (p), so we can do gradient ascent on a logistic regression model.

Caveat: optimizing (\widetilde{F}_\beta) is not identical to optimizing the hard-thresholded Fβ, but it can be a useful demonstration (and sometimes a practical heuristic).

def soft_fbeta_and_grad(w, Xb, y, *, beta=2.0, l2=0.0, eps=1e-12):
    """Return (soft_fbeta, grad_w) for a logistic regression model p = sigmoid(Xb @ w)."""
    if beta <= 0:
        raise ValueError("beta must be > 0")
    beta2 = beta**2

    z = Xb @ w
    p = sigmoid(z)
    y = y.astype(float)

    tp = np.sum(y * p)
    sp = np.sum(p)  # tp + fp
    pos = np.sum(y)
    denom = sp + beta2 * pos + eps
    f = (1.0 + beta2) * tp / denom

    # dF/dp_i
    dF_dp = (1.0 + beta2) * (y * denom - tp) / (denom**2)
    dF_dz = dF_dp * p * (1.0 - p)
    grad = Xb.T @ dF_dz

    if l2:
        f -= 0.5 * l2 * np.sum(w[1:] ** 2)
        grad[1:] -= l2 * w[1:]

    return float(f), grad


def fit_logistic_regression_soft_fbeta(X, y, *, beta=2.0, lr=1e-3, steps=4000, l2=1e-3, seed=0):
    rng_local = np.random.default_rng(seed)
    Xb = add_intercept(X)
    w = rng_local.normal(scale=0.01, size=Xb.shape[1])

    history = []
    for step in range(steps):
        f, grad = soft_fbeta_and_grad(w, Xb, y, beta=beta, l2=l2)
        w += lr * grad
        if step % 50 == 0 or step == steps - 1:
            history.append((step, f))
    return w, np.array(history)


beta_opt = 2.0
w_soft, hist_soft = fit_logistic_regression_soft_fbeta(
    X_train, y_train, beta=beta_opt, lr=2e-3, steps=6000, l2=1e-3, seed=2
)
hist_soft[:5], hist_soft[-5:]
(array([[  0.    ,   0.3058],
        [ 50.    ,   0.3105],
        [100.    ,   0.3151],
        [150.    ,   0.3197],
        [200.    ,   0.3242]]),
 array([[5800.    ,    0.5   ],
        [5850.    ,    0.5005],
        [5900.    ,    0.501 ],
        [5950.    ,    0.5015],
        [5999.    ,    0.502 ]]))
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_soft[:, 0], y=hist_soft[:, 1], mode="lines", name=f"soft F{beta_opt:g}"))
fig.update_layout(
    title=f"Logistic regression training (maximize soft F{beta_opt:g})",
    xaxis_title="step",
    yaxis_title=f"soft F{beta_opt:g}",
)
fig.show()
val_scores_soft = predict_proba(w_soft, X_val)
test_scores_soft = predict_proba(w_soft, X_test)

best_ce = best_threshold_for_fbeta(y_val, val_scores_ce, beta=beta_opt, thresholds=np.linspace(0, 1, 501))
best_soft = best_threshold_for_fbeta(y_val, val_scores_soft, beta=beta_opt, thresholds=np.linspace(0, 1, 501))

test_ce = evaluate_thresholded(y_test, test_scores_ce, threshold=best_ce["threshold"], beta=beta_opt)
test_soft = evaluate_thresholded(y_test, test_scores_soft, threshold=best_soft["threshold"], beta=beta_opt)

rows = [
    {
        "model": "cross-entropy + threshold tuning",
        "val_best_threshold": best_ce["threshold"],
        "val_Fbeta": best_ce["fbeta"],
        "test_precision": test_ce["precision"],
        "test_recall": test_ce["recall"],
        "test_Fbeta": test_ce["fbeta"],
    },
    {
        "model": f"maximize soft F{beta_opt:g} + threshold tuning",
        "val_best_threshold": best_soft["threshold"],
        "val_Fbeta": best_soft["fbeta"],
        "test_precision": test_soft["precision"],
        "test_recall": test_soft["recall"],
        "test_Fbeta": test_soft["fbeta"],
    },
]

try:
    import pandas as pd

    pd.DataFrame(rows)
except Exception:
    rows

Pros / cons and when to use Fβ#

Pros#

  • Focuses on the positive class (TP/FP/FN) — useful for imbalanced problems where TN is less informative

  • Adjustable trade-off via (\beta): pick recall-heavy ((\beta>1)) or precision-heavy ((\beta<1))

  • Single number summarizing the precision–recall trade-off (easy to compare models)

Cons#

  • Threshold-dependent: you must choose a threshold (or a policy) to get a meaningful number

  • Not a proper scoring rule: it does not reward well-calibrated probabilities the way log loss / Brier score do

  • Ignores TN: can be misleading when TN matters (e.g., overall error rate is critical)

  • Not smooth in the hard form (can’t be directly optimized with gradient descent without surrogates)

When it’s a good fit#

  • Information retrieval / search / recommendation (relevant items are “positive”)

  • Medical screening or safety monitoring (often (\beta>1) to favor recall)

  • Fraud / abuse detection (often (\beta<1) if false positives are expensive)

Pitfalls + diagnostics#

  • Pick (\beta) based on real costs (FN vs FP), not after looking at the test set.

  • Always tune thresholds on a validation set (or use cross-validation).

  • For highly imbalanced data, look at the full precision–recall curve; a single Fβ can hide failure modes.

  • Be explicit about the positive class (pos_label).

  • Decide how to handle zero-division cases (no predicted positives, or no actual positives).

Exercises#

  1. Implement macro-averaged Fβ for multiclass classification via one-vs-rest.

  2. For a fixed model, show how the best threshold changes when (\beta\in{0.25,0.5,1,2,4}).

  3. Compare optimizing (a) log loss + threshold tuning vs (b) a soft-Fβ surrogate on a more extreme imbalance (e.g., 1% positives).

References#

  • scikit-learn fbeta_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html

  • scikit-learn metrics overview: https://scikit-learn.org/stable/modules/model_evaluation.html

  • C. J. van Rijsbergen, Information Retrieval (discussion of the (F_\beta) measure)